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Unlabeled shape analysis is a rapidly emerging and challenging 
area of statistics. This has been driven by various novel applications 
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in bioinformatics. We consider here the situation where two configu- 
rations are matched under various constraints, namely, the configura- 
tions have a subset of manually located "markers" with high probabil- 
ity of matching each other while a larger subset consists of unlabeled 
points. We consider a plausible model and give an implementation 
using the EM algorithm. The work is motivated by a real experiment 
of gels for renal cancer and our approach allows for the possibility 
of missing and misallocated markers. The methodology is success- 
fully used to automatically locate and remove a grossly misallocated 
marker within the given data set. 

1. Introduction. 

1.1. Western Blots. Our motivating application concerns gel techniques 
used to identify proteins present in human tissue. First, two-dimensional 
electrophoresis (2-DE) is used to separate all the proteins extracted from 
a cell. The 2-DE gel is then probed with serum which contains antibodies 
that will bind to specific proteins. The image of a Western Blot will contain 
only the location (and intensity) of proteins that have a bound antibody. 
We can think of Western Blots as containing only a subset of the proteins 
that are displayed on 2-DE images. The extra step necessary to create a 
Western Blot allows a further level of variability within the final image. The 
reproducibility of Western Blots is therefore even more challenging than that 
of 2-DE images. To help align Western Blots, suitable marker proteins are 
experimentally determined and are generally expected to be present in all 
blots under investigation. A stain is applied to each blot which will highlight 
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Fig. 1. Western Blot image with red crosses depicting the subject-treatment specific non- 
markers. The larger black crosses indicate the labeled markers, with their acidity and mass 
measurements (not spatial coordinates) highlighted beneath. 



all proteins present, therefore enabling an expert to manually locate the suit- 
able markers. Figure 1 shows an annotated Western Blot image which shows 
the markers (with the acidity and mass measurements associated with these 
points) and further points detected by an image analyzer. The markers are 
used to align the blots by minimizing a sum of squared euclidean distances 
(usually not the acidity and mass measurements). In some cases, fine adjust- 
ments to alignments are made using various heuristic techniques. See, for 
example, Forgber et al. (2009) and Zvelebil and Baum [(2007), pages 613- 
620] for more details. 

Considering the large scope for variation between images and the often 
vast number of proteins located in a comparatively small area, visual exam- 
ination to analyze or compare images, although often informative, can be 
extremely difficult and conclusions unreliable. Visual comparison can also be 
extremely repetitive and laborious for the expert making the comparisons. 
Statistical and computational analysis is essential to the result accuracy and 
reduction of expert manual labor. The main aim is to locate a biomarker 
whose mere presence can be used to measure the progress of disease or treat- 
ment effects. In the case of the gel data, a point becomes a biomarker if it is 
found to have this property. The intensity of a biomarker, indicated by the 
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intensity of the mark on the image, can also provide information about the 
disease progression or treatment effect, but this is beyond the scope of this 
paper. 

1.2. Unlabeled configuration matching. In the more general setting, the 
problem is to match two sets — usually of unequal size — of points, in which 
the correspondence (matching) of the points is unknown. The solution will 
include the transformation required to align the sets, a list of correspon- 
dences which map (some of) the points, and will penalize solutions with 
many unmatched points, allowing for a trade-off in the goodness of fit in the 
aligned points. 

Approaches to closely related problems include the RANSAC algorithm 
[Fischler and Bolles (1981)], nonrigid point matching using thin-plate splines 
[Chui and Rangarajan (2003)], a correlation-based approach using kernels 
[Tsin and Kanade (2004), Chen (2011)], nonaffine matching of distributions 
[Glaunes, Trouve and Younes (2004)] and the Iterative Closest Point Algo- 
rithm [Besl and McKay (1992)] for the registration of various representations 
of shapes. All of these methods avoid making distributional assumptions, 
with a consequence that probabilistic statements are then difficult to make. 
By contrast, Czogiel, Dryden and Brignell (2011), Dryden, Hirst and Melville 
(2007), Kent, Mardia and Taylor (2010a), Taylor, Mardia and Kent (2003) 
and Green and Mardia (2006) use statistical models to obtain solutions. 
These latter papers all use examples drawn from protein bioinformatics; 
a review is given by Green et al. (2010). 

In this paper we address a more specific problem in which each config- 
uration contains a subset of points ("markers") whose labels correspond 
with high probability, with the remaining points having arbitrary labels 
(nonmarkers) as before. Suppose we have two configurations of observed 
landmarks in d dimensions: markers given hy Xj , j = 1, . . . , K and fXi, i = 
1, . . . ,K, and nonmarkers /ij, i = K-\-l, . . . ,K + m and Xj, j = K + 1,...,K + 
n. These are represented as matrices x{{K + n) x d) and ij.{{K + m) x d) 
in which K is usually smaller than m and n. In our model, the markers 
(the spatial coordinates of the large black crosses in Figure 1) and Xi for 
i = 1, . . . , K have been identified by an expert to correspond to the same 
proteins (referred to as a "points" hereafter). However, these are labeled 
with some uncertainty, so true correspondence is likely but not guaranteed. 
So it is possible, for example, that markers in fj, could correspond to non- 
markers in X, or have no correspondence at all. For fii and xj with i,j > K, 
(the spatial coordinates of the red crosses in Figure 1) we have no prior 
information about correspondence probabilities. 

1.3. Statistical model. A statistical model in the general setting involves 
three main components (see Figure 2): 
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Fig. 2. Illustration of the mam ingredients of a statistical model. The labels of the two 
configurations of points (x and fi) can be considered as arbitrary. Some of the x points are 
aligned to some of the fi points by a transformation (e.g., translation, rotation) which be- 
longs to a specified group. An 0/1 matrix M indicates which points match, with unmatched 
points in x (point 2 in the illustration) assigned to label "0," and a specific error model 
assumed for the magnitude of the residual after alignment. 



(a) A group Q, say, on M"' representing the permitted transformations {g) 
on (a subset of the landmarks of) fi to bring it close to (a subset of the 
landmarks of) x,g (zQ. 

(b) A matching matrix M, say, identifies which elements of x correspond 
to which elements of fi for the markers as well as unlabeled points. 

(c) An error model indicating how close the elements of x and will be, 
after the correct transformation and labeling are used. 

In Section 2 we introduce our statistical model and emphasize the group 
of affine transformations belonging to G which is relevant to our example. 
The appropriate matching matrix M is estimated under various scenar- 
ios, including the use of a matrix Q of prior probabilities, which is intro- 
duced to reflect the existence of the markers (labeled points) — an integral 
part of the specific problem. In Section 3 we outline likelihood based in- 
ference for M, and describe an EM algorithm. In Section 4 we adapt the 
prior matrix Q when either a marker is missing or a marker is wrongly 
identified. Two real examples are studied in Section 5 related to renal can- 
cer. In the first example, one marker is grossly misallocated and in the 
second example, some markers are missing. This procedure has great po- 
tential to automate preprocessing of the gels. We conclude with a discus- 
sion. 
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2. Statistical models. 

2.1. Transformations. Although the statistical model we later introduce 
can apply to various types of transformations, we focus on an affine transfor- 
mation of the form g{fi) = /lA' + B' , where A is a nonsingular d x d matrix 
and the d x 1 vector, b, is present in every column of the d x (K + m) 
matrix B. 

2.2. Matching matrix. To estimate the parameters of an appropriate 
transformation of /x, we can introduce a correspondence system that will in- 
dicate whether a point in /i is associated with a point in x, that is, whether 
two points match across configurations. We can record the correspondence 
information in a {K + m + 1) x (K + n) matching matrix, M, where 

{1, for i = if Xj does not have a matching point in /x, 
1, for i = 1, . . . , K -\- m if matches /ij, 
0, otherwise 

for j = 1, . . . , K + n. Note that, for simplicity of notation, we use Mqj = 
Mx+m+ij, and similarly for other matrices. If Mqj = 1, then Xj does not 
have a matching point in fi and we say that xj is unmatched. 

We consider one-to-one or many-to-one matches between points in x and 
points in ji. We refer to these as hard and soft matches, respectively. Soft 
matching can be useful in our application since a single protein can produce 
multiple spots on an image [Banks et al. (2000)]. 

Hard matches: The matching matrix, M, has the following constraints for 
the hard model: 

K+m 

(1) ^Mij = l for j = l,...,K + n 

i=0 

and 

K+n 

(2) ^Mij<l for i = l,...,A' + m. 

So for ii ^ 0, if Mi^j^ = 1, then Mj^jj = -^«2ii — *i ^2 and ji 7^ j2- 

Note that there are no constraints on row K + m + l in M since each of the 
K -\-n points in x is free to remain unmatched. Figure 2 illustrates the case 
of hard matches in which the point X2 is unmatched, so M02 = 1- 

Soft matches: For the soft model, the only constraint is stated in (1). That 
is, if Mjjj^ = 1, then Mi^j-^ = for all ii^ 12, but Mj^jj € {0, 1} for ji ^ j2- 
When assigning either hard or soft matches, (1) constrains a point in x to 
be matched to a single point in fi or, alternatively, to remain unmatched. 
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2.3. Error distribution. Assuming the transformation parameters, 
A and b, are known, we can apply a distribution to Xj given the match 
Mij = 1. Given the transformation, we treat the elements of x as condition- 



ally independent with the following densities for j = 1, . . 

1 ( 11'^? -^^i 



(3) 



p{xj\Mij 



1) 



(27r(j2)rf/2 



exp 



,K + n: 

b\?' 



1 
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for i - 



= 1, 
0, 
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where O is some region in containing all points in x. 

To allow for the possibility of soft matching, we consider points in x to be 
independent. As we have K markers in each image, we have prior information 
about the matching across images. Next we introduce notation to deal with 
prior matching probabilities. 

2.4. Prior matching matrix probabilities. Let Q be a {K + m+ 1) x {K + 
n) matrix with elements qij = p{Mij = 1). That is, for j = 1, . . . , K + n, qij 
is the prior probability that /ij is matched to Xj for i = 1, . . . ,K -\- m and 
the prior probability that xj is unmatched for i = 0. Again, for simplicity of 

notation, we use qoj in place of qK+m+i,j- Note that "^^^"^ qij = 1 for j = 
1, . . . ,K + n. We have prior knowledge that corresponding markers, fij and Xj 
for j = 1, . . . , K , should match. We propose a structure to determine the qij, 
which accounts for the possibility of error when allocating markers within 
a warped image and does not force corresponding markers to match. In what 
follows, it will be helpful to note that the matrix Q can be partitioned into 
submatrices of size (rows x columns) as follows: 

Qiil + K + m) X (K + n)) 

I QW(lxi^) I \ 



V 



Q(i)((K + m) X K) 

Markers in x: We know that fij are the coordinates for marker j in /i, 
j = 1, . . . , K . Let 7j be the index of the true marker j in fx. If -jj = j, then 
the marker j has been correctly identified. We set the prior probability of 
a point Hi being the true marker j, qij, to be a function of the distance 
between /i, and fj,j so that Q^^^ has elements 

(4) q^j =p{jj =i) = f{dij) for i = l,...,K + m,j = 1,...,K, 

where dij is the Euclidean distance between /ij and fij and choices for / are 
discussed later. 
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Main ingredients of the statistical model used for matching of 
partially labeled configurations of points. Section numbers [e.g., (3.1)] are 
used to sign-post further details or discussion 



Component of model 

Configurations x and /i 

Transformation group 
Matching matrix, M 



Variants 

Unlabeled (Section 1.2) 

Partially labeled 
Rigid-body (Section 2.1) 
Affine (Section 3.1) 
Nonlinear (Section 6) 
Hard (Section 6) 
Soft 



Examples 



Markers (Section 1.1) 



One-to-one 
Many-to-one (Section 6) 
Many-to-many (Section 6) 



Prior matrix, Q, 
with Qij = P{Mij = 1) 
which depends on 

- markers (Section 4) Function of distance (Section 3.3.1) 

- nonmarkers 

Error distribution Isotropic (Section 2.3) 

Nonlinear (Section 6) 



Next we consider the possibility that a marker within x does not have 
a corresponding point in fj,. Recall that Xj are the coordinates for marker j 
in X, j = 1, . . . , K . To allow for the possibility that xj remains unmatched, 
we set the prior probability of Mqj = 1 to be uniform so that Q^^^ has 
elements 

(5) = p{jj = 0) = -1-^ for j = 1,...,K, 

where ft is given as in (3). 

Nonmarkers in x: To allow for matching of the nonmarker points, we can 
set the elements of Q*-^-* as 

(6) gj=— — ^— — , i = 0,...,K + m,j = K + 1,...,K + n. 

K + m + l 

So the prior matching probability of a nonmarker Xj is uniform. 

As an example, we suppose that in Figure 2 only point 1 has been iden- 
tified as a marker in both x and ft, then we might have goi = 0.01 (= 
say), qii = 0.89, q^i = 0.01, q^i = 0.09, qyi = 0.00 (based on the interpoint 
distances within fx) and qij = 1/8 for the other points shown (taking m = 6 
in this example). 

For ease of reference, the ingredients of the statistical model, together 
with possible variations, are listed in Table 1. 
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3. EM algorithms and inference. 

3.1. EM algorithm. We use an EM algorithm [McLachlan and Krishnan 
(2008)] to estimate the transformation parameters, A and b, that will su- 
perimpose fi onto X. Throughout this section we assume that cj^ has been 
assigned (see Section 3.3.3). In the E-step we calculate the posterior prob- 
ability that Hi matches Xj, that is, the posterior probability that Mjj = 1. 
In the M-step the posterior probabilities are input into the expected like- 
lihood of observing M, given the data, x. This enables us to estimate the 
transformation parameters, A and b. 

E-step: We calculate the posterior probability of fii matching xj, given 
the data, using Bayes' theorem: 

, N p(xAMii = l)p(Mii = 1) 

p{Xj) 

where p{xj\Mij = 1) is calculated using (3), and qij = p{Mij = 1) is calcu- 
lated using (4)-(6). The denominator of (7) is given by J2i^"'^ Pi^jl^ij = 
l)xp{Mi, = l). 

M-step: Starting from the multinomial form [McLachlan and Krishnan 
(2008), page 15] 

K+m K+n 

l{M\x)= M^J\ogp{xJ), 

i=0 j=l 

we substitute pji for Mjj and qijp{xj\Mij = 1) for p{xj)to obtain the ex- 
pected log-likelihood of the matching matrix, M, given the data, x: 

K+m K+n 

(8) E[/(M|x)] =Y.Y. M^^ma + \ogp{xj\Mij = 1)]. 

Here, we suppress the dependence on the parameters A and h. 

Both the prior probabilities stored in Q and the conditional distribution 
of Xj being unmatched are independent of A and 6, so, using (8), we estimate 
the transformation parameters that maximize 

K+m K+n 

X] X] ^J'^°SP(a;il^ii = 1) 

i=l j=l 

^log(2vra^) 



2ct2 2 



Note that the final term is a constant, given that a is assumed known. 
Removing further terms independent of A and 6, we want to estimate the 
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transformation parameters that minimize 

K+mK+n 

i=i j=i 

Ignoring the terms independent of h, and noting that da'x/dx = a and 
dx'x/dx = 2x, the maximum hkeUhood estimates [Walker (2000)] are 



(9) 
and 

(10) 



A 



EK+m \-^K+n 
i=l 2^1=1 Pji 



K+mK+n 
. i=l j=l 



'K+mK+n 1 

i=l j=l 

The algorithm alternates between the E-step and the M-step. At each iter- 
ation, the transformation parameters are updated in the M-step to ^(^+^) = 
A^^'^ and 5^'''''^^ = , before substitution into the E-step for the next itera- 
tion. 

We assign convergence to be when r is such that 

K+m K+n 



(11) 



1 



{K + m+l){K + n) 



i=0 j=l 



(r+l) (r)-:2 



p^:t<io-', 



where / is chosen and the posterior probability of fii matching Xj at the 

rth and {r + l)st iteration is denoted by p^J'^ and P^j^i^^\ respectively, for 
i = 0, . . . ,K + m and j = 1, . . . , K + n. 

3.2. Inference for M . Let P be the {K + n) x {K + m + \) matrix con- 
taining the final posterior matching probabilities. Let A and h be the final es- 
timates of the transformation parameters obtained from the EM algorithm. 

An obvious route to estimate the matching matrix, M, is to use the 
posterior matching probabilities, but this will not yield a one-to-one out- 
come. For one-to-one matches we need to satisfy the constraints in (1) 
and (2). Given the transformation, the conditional log- likelihood of M is 
^^^^^f=i^ ^hj^ogPji. We find M that maximizes this log-likelihood by 
mixed integer linear programming. In our implementation we imputted the 
2K + m + n constraints into lp_solve [Berkelaar (2008)], which then yields 
the estimated one-to-one matching matrix, M. We can summarize the steps 
as follows. 
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Composite algorithm. 

(i) Assign qij using (4), (5) and (6) for i = 0, . . . , K + m and j = 1, . . . ,K + 
n. 

(ii) Find initial estimates of the transformation parameters, A^^'^ and b^^^ , 
and assign the variance, a^. Possible choices are discussed in the fol- 
lowing subsection. 

(iii) Run the EM algorithm to get the updated estimates, p^^^ , A^^^ and b^^^ , 
using (7), (10) and (9), respectively. 

(iv) Repeat step 3 to find the updated estimates, P^Ji^^^ , A^'^'^'^^ and , 
until convergence [defined in (11)] is reached. Let the final posterior 
matching probabilities be stored in the {K + n) x [K + m-\-\) matrix P 
and the final estimated transformation parameters be denoted by A and 
b. 

(v) One-to-one matches are obtained using the hardening algorithm de- 
scribed above. 

(vi) Treating the matches within the inferred matching matrix, M , as known, 
we can update the transformation parameters using Procrustes method- 
ology [Dry den and Mardia (1998)] to calculate the final estimates, A 
and b. 

3.3. Assigning the function and parameters within the EM algorithm. 
We need to assign the function / stated in (4), as well as starting values for 
the transformation parameters denoted by ^4^^^ and b^'^\ and a variance o"^. 
We look at each assignment separately. 

3.3.1. Distance function. As before, contains the allocated marker 
coordinates for marker j in ^u, j = 1, . . . ,K , and 7j is the index of the true 
marker j in /i. Let dij denote the expected distance between a point fii 
and for i = 1, . . . ,K -\- m. Due to the freedom for a gel to warp, in reality 
the distance between and /ij in an image is dij = dij +e, where e denotes 
some error. 

Our choice of the function, /, in (4), considers all points in /i as possible 
true markers. We adopt a multivariate normal distribution for e, which gives 

(12) Qij = pijj =i)(x exp 

for i = 1, . . . , K + m, where cr^ is the variance between two points in fi 
(assuming independence across dimensions). So the probability that /ij is 
the true marker j will decrease the further it is from fij. 

3.3.2. Starting values for transformation parameters. As we have prior 
knowledge of allocated corresponding markers in both and x, it is sensible 
that ^(0) and b^") are set as the transformation parameters necessary to best 
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superimpose corresponding markers. Dryden and Mardia (1998) show how 
these parameters can be estimated from the matrix, 

(13) i?=(/.:^.)-v>^"'\ 

where is the K x (d+l) matrix /i* = (i/^, ^''™'^) and 1^ is a vector of ones 
of length K. The K x d matrices, /i^"*) and x^"^\ contain only the marker 
coordinates for fj, and x, respectively. 

The first column in R' contains b^'^^ and the second two columns in R' 
contain the d x d matrix A^^^ . 

3.3.3. Starting values for the variance between images. We can estimate 
the variance <t^ by considering the mean squared distance between corre- 
sponding markers in and x after an affine transformation has been applied 
to superimpose them. That is, set 

(14) ^2^i^|j^._^{0)^._^{0)||2^ 

where v = dK — d? — d and denotes the degrees of freedom. Here dK is the 
number of error terms in the d components of the K markers. This number 
is reduced in v to accommodate the estimates of ^4^'^) and b^'^^ . 

4. Grossly misallocated or missing markers. This section describes fur- 
ther refinements to the above Composite Algorithm, which is highly de- 
pendent on the transformation parameters input as starting values, A^'^'^ 
and b^^\ We have previously stated that the affine transformation neces- 
sary to superimpose corresponding markers in /i and x will provide sensible 
starting values for the transformation parameters within the EM algorithm. 
However, this would not be the case if gross misallocations occur. The num- 
ber of missing or grossly misidentified markers are dependent on the quality 
of the equipment and the expert that creates the images. 

First, we provide a method that will highlight grossly misallocated mark- 
ers across images. Highlighted markers can then be automatically removed 
or corrected before they are used within the EM algorithm to estimate trans- 
formation starting values. Then, in Section 4.2 we deal with the case where 
some markers are missing from one of the images. 

4.1. Grossly misallocated markers. Gross misallocations of a marker may 
occur through human error when inputting marker labels into data spread- 
sheets. Dryden and Walker (1999) consider procedures based on S estima- 
tors, least median of squares and least quartile difference estimators that 
are highly resistant to outlier points. The RANSAC algorithm [Fischler and 
Bolles (1981)] uses a similar robust strategy. Here we describe how we can 
use the EM algorithm previously described. 
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Here we provide a method that will highlight grossly misallocated markers 
across images. Highlighted markers can then be automatically removed or 
corrected before they are used within the EM algorithm to estimate trans- 
formation starting values. 

Let /i^™) and x^^^ he K x d coordinate matrices where fij and xj contain 
the coordinates of marker j in /_f and x, respectively, for j = 1, . . . , K . Here we 
consider the prior matching probabilities to be independent of the distance 
between a possible marker and the allocated marker so that 

(PM, iori = j, 

(15) Qij = { 1 -PM f ., . 

\ , for«7^j, 

where pM denotes the probability that the allocated marker fXj truly corre- 
sponds to the allocated marker Xj. 

We input /x^™) and x^™^ into steps (i)-(v) of the composite algorithm to 
estimate the one-to-one matching matrix M, replacing (4) and (5) with (15) 
in stage (i). We use (13) to estimate the starting transformation values, 
A^^^ and b^^^ . Note that the starting transformation will be distorted by the 
presence of grossly misallocated markers. There are four possible outcomes 
for k = l,...,K: 

• The allocated corresponding markers Hk and Xk are matched if Mkk = 1- 
We include both and x/^. in further analyses. 

• The marker x^ remains unmatched if Mq^ = 1- We exclude both /Ufc and Xk 
from further analyses. 

• No point in x*-*"^ is matched to the marker /i^ if M^j = 0, for all j = 
1, . . . ,K. We exclude both /i^ and Xk from further analyses. 

• The marker fi^^ is matched to an allocated noncorresponding marker 

if Mfcifc2 — 1) fo'^ ^1 7^ ^2- We exclude Hki, /^fc2 5 Xk-^ and from further 
analyses. 

See Section 5.1 for an illustration. 

4.2. Missing markers. It is possible that all K markers are not success- 
fully located in both /i and x. For example, only 10 out of the possible 
K = 12 markers were located in the image displayed in Figure 1. 

There are four possible cases we must consider for Marker k = 1, . . . ,K: 
(a) located in both /i and x; (b) located in /i alone; (c) located in x alone; 
and (d) not located in either /i or x. We first introduce notation to allow for 
the possibility of missing markers. 

Let Kfj^ and be the total number of markers located in and x, 
respectively. As previously noted, let be the {K + m) x d coordinate matrix 
and X be the {K + n) x d coordinate matrix. 

If marker k is located in fi, then /u^ contains the coordinates of marker k 
in fi. If marker k is not located in /i, then = 0. Similarly, if marker k 
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is located in x, then Xk contains the coordinates of marker k in x, for k = 
1,. . . ,K. If marker k is not located in x, then = 0. 

As previously stated, Q is the {K + m + 1) x (K + n) matrix containing 
the prior matching probabilities for points in x. We define Q by allowing 
for the possibility that an allocated marker k is not the true marker k, for 
k = l,...,K. 

Markers in x: corresponding to each of the above cases we have: 

(a) If fij 7^ and Xj ^ 0, we assign qij as previously stated in (4) and (5) 
for i = 0, . . . ,K + m. 

(b) If Hj 7^ and Xj = 0, we treat nonmarker. 

(c) If Hj = and xj 7^ 0, we treat nonmarker. 

(d) If Hj = and xj = 0, we set qij = Qjk = for all i and k. 

Nonmarkers in x: The prior matching probability of a nonmarker, Xj, 
is again set to be uniform over all matching possibilities so that, for i = 
0, . . . ,K + m and j = K + 1, . . . ,K + n, 

^''^ '- = K, + m + l 

In case 3, when fij = and Xj 7^ for j = 1, . . . , K , we treat Xj as a non- 
marker and use (16) to calculate qij for i = 0, . . . , K + m. 

Note that /x contains markers and m nonmarkers. There are only K^-\- 
m + 1 matching possibilities for a point in x, thus producing the denominator 
in (16). See Section 5.2 for an illustration. 

5. Examples. Our full data set — see Supplementary Material [Mardia, 
Petty and Taylor (2012)] — was collected to represent eight subjects, under 
two different conditions, treated with two possible treatments. 

A replicate image was also produced for each subject-treatment specific 
case. A typical Western Blot is shown in Figure 1, which is approximately of 
size 280 x 220. In this paper we illustrate the methods on two pairs of images: 
in the first example, robustness to gross misidentification is explored, and 
the second example deals with missing markers. 

5.1. Grossly misallocated marker. Let // and x represent the coordinate 
sets on Western Blots of a renal cancer cell line cultured under either nor- 
moxic or hypoxic conditions. The proteins are then extracted and probed 
with either patient sera or control sera in a Western Blot to produce the 
images generated. All K = 12 markers were located in both images. 

We input the corresponding markers for and x into steps (i)-(v) of the 
composite algorithm (see Section 3.2) to estimate the one-to-one matching 
matrix, M, found when superimposing ^j^^' onto That is, we transform 
the appropriate markers in /x onto the corresponding markers in x. Using 
only the markers, we estimate the variance in (3) as cj^ = 4.5^ and set the 
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Fig. 3. Initial transformation, before (left) and after (right) marker 1 is removed as 
a marker from both images. 

prior matching probability in (15) as pM = 0.99. The starting values for the 
transformation parameters, A^'^^ and b^'^\ are found using (13). We use the 
final posterior probabilities, P, to estimate M. Marker 1 remains unmatched 
in both images. 

Figure 3 shows the initial transformation of fi onto x before and after 
marker 1 is removed as a marker (though still displayed) in both images. In 
this example, the RMSD between the 12 marker pairs before the removal is 
19.44. The RMSD between the remaining 11 marker pairs after the removal 
is 2.96. 

Following these discoveries, we were told that marker 1 was incorrectly 
labeled as spotID 136 when it should have been spotID 153, that is, the 
methodology was able to highlight a misidentified marker. 

5.2. Missing markers. In this example we display the matches made 
when comparing two replicate specimens, representing a cell line cultured 
under either normoxic conditions, with proteins extracted and probed with 
control sera. All 12 markers were located in /x. Markers 9 and 10 were missing 
in X, so these were treated as nonmarkers in /i and we set K = 10. 

We input the images into steps (i)-(v) of the composite algorithm. The 
starting values for the transformation parameters, and 6(0), are found 
using (13). We estimate the variance in (3), o"^, using (14) with denom- 
inator u. Finally, we set = o"^ in (12). The estimated transformation 
parameters are 

^_/0.980 -0.047\ 
~ V0.002 1.006 J 

and b = (—1.72,10.78)'. We display the matches made in Figure 4 after the 
final transformation of /i onto x. 



6. Discussion. Many EM algorithms are known to converge only to a lo- 
cal solution, and this will also apply to the methods considered here. How- 
ever, the availability of the markers which provide partial information will 
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Fig. 4. Final transformation of fi onto x and the matches made. Points in x (•), points 
in transformed fi (+), markers in x (V ) and markers m (^)- The 107 matched points 
across images are joined by a line. 



usually ensure good starting values, so this will not be a problem in our 
application. 

Note that it would be possible to adapt the model so that a could be 
allowed to vary according to the distance of the point to the edge of the 
image, which could be used to deal with minor nonlinear deformations. More 
generally, it should also be possible to adapt our methods to deal with 
more general transformations, for example, using thin-plate splines [Chui 
and Rangarajan (2003)]. 

There are situations when clusters occur within a gel which makes it 
difficult to correctly identify a marker within a cluster of points. We can allow 
for the increased likelihood that a marker fij,j = 1, . . . ,K, is misallocated if 
it exists within a cluster of other points, by using an adaptive choice of / in 
the prior (4): 



if dij < e, 
0, if dij > e. 



where dij is the Euclidean distance and Cj is the number of points in /x that 
are within a distance of e from fij, that is, 

K+m 

Cj= E ndij<e], 



1=1 



where I[dij <e] = l if dij < e, (0 otherwise) for i = 1, . . . ,K + m. 

A further adaptation of the model, which could be useful in Western Blots, 
would be to incorporate in the priors a measure associated with the grey- 
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scale intensity of the located points in the image [Rohr, Cathier and Worz 
(2004)]. Approaches for this, as well as further models for the background 
noise, are considered in Petty (2009). 

Our composite algorithm ensures one-to-one matches, but there are cir- 
cumstances in which many-to-one or many-to-many matches can be con- 
sidered. These can be useful when comparing protein images since multiple 
forms of an individual protein can often be visualized [Banks et al. (2000)]. 
That is, a single protein can produce multiple spots on an image. 

It should be noted that our model is asymmetric in and x. This is not 
uncommon; for example, the full Procrustes error is not symmetrical [see 
Dryden and Mardia (1998)]. Also, the standard RMSD used by bioinforma- 
tricians is again not a symmetrical measure. However, there are symmet- 
rical unlabeled shape analyses; see Green and Mardia (2006), for example. 
However, this method has not been developed for affine transformations 
and warping as required here. There is also a nonprobabilistic method of 
Rangarajan, Chui and Bookstein (1997) for similarity shape, but again the 
extension of the method to affine transformations and warping requires fur- 
ther work; see Kent, Mardia and Taylor (2010b) for a statistical framework. 
For the data considered here, we have verified that reversing the role of fi 
and X does not change the broad conclusions. 

Finally, we note that the methods described in this paper could have 
applications in other situations in which there are unlabeled points, some 
of which — possibly with error — have been manually identified. Thus, the 
method could be used in the preparation of ground truth for training an 
object recognition system or a pose estimation system; for example, see the 
survey of Murphy-Chutorian and Trivedi (2008). 

Acknowledgments. We would like to thank Roz Banks and Rachel Craven 
for providing us with gel data and general discussion concerning protein gels. 
We would also like to thank David Hogg for useful references about further 
applications. 

SUPPLEMENTARY MATERIAL 

Western Blot data (DOI: 10.1214/12-AOAS544SUPP; .gz). The supple- 
mentary data contains a zipped file which includes information taken from 
28 Western Blots. This represents 8 subjects (four controls and four patients) 
treated with two possible treatments. A replicate image is also obtained for 
each subject-treatment combination, though some replicates are missing. 
Further details are included in the associated README file. 
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